Method of integrating ad hoc camera networks in interactive mesh systems

ABSTRACT

An entertainment system has a first recording device that records digital images, a server that receives the images from the first device, wherein the second device, based on data from another source, enhances the images from the first device for display.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of Provisional Application No.61/400,314, which is incorporated by reference as if fully set forth.

FIELD OF INVENTION

This relates to sensor systems used in smartphones and networked camerasand methods to mesh multiple camera feeds.

BACKGROUND

Systems such as Flickr, Photosynth, Seadragon, Historypin work withmodern networked cameras (including cameras in phones) to allows formuch greater sharing and shared power. Social networks that use locationsuch 4square are also well known. Sharing digital images and videos, andcreating digital environments from these, is a new digital frontier.

SUMMARY

This disclosure describes a system that incorporates multiple sources ofinformation to automatically create a 3D wireframe of an event that maybe used later by multiple spectators to watch the event at home withsubstantially expanded viewing options.

An entertainment system has a first recording device that recordsdigital images, a server that receives the images from the first device,wherein the second device, based on data from another source, enhancesthe images from the first device for display.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a smart device application and system diagram.

FIG. 2 illustrates multiple smartphones as a sensor mesh.

FIG. 3 illustrates phone users tracking action on field.

FIG. 4 illustrates using network feedback to improve image.

FIG. 5 illustrates smartphone sensors used in the application.

FIG. 6 illustrates smartphone sensors and sound.

FIG. 7 shows a 3D space.

FIG. 8 illustrates alternate embodiments.

FIG. 9 illustrates video frame management.

FIG. 10 illustrates a mesh construction.

FIG. 11 illustrates avatar creation.

FIG. 12 illustrates supplemental information improving an avatar.

FIG. 13 shows an avatar point-of-view.

FIG. 14 shows data flow in the system.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Introduction

Time of Flight (ToF) cameras, and similar realtime 3D mappingtechnologies may be used ins social digital imaging because they allow adetailed point cloud of vertices that represent individuals in the spaceto be mapped as three dimensional objects, in much the same way thatsonar is used to map underwater geography. Phone and camera makers areusing ToF and similar sensors to bring greater fidelity to 3D images.

In addition, virtual sets, avatars, photographic databases, videocontent, spatial audio, point clouds, and other graphical and digitalcontent enables a medium that blurs the space between real worlddocumentation, like traditional photography, and virtual space, likevideo games. Consider, for example, the change from home brochures toonline home video tours.

The combination of virtual set and character, multiple video sources,location-tagged image and media databases, and 3 dimensional vertex datamay combine to create a new medium in which it is possible to literallysee around corners, interpolating data that was unable to record andblending it with other content available in the cloud, on within theuser's own data. The combination of this content will blend video gamesand reality in a seamless way.

Using this varied content, viewers will be able to see content that wasnever recorded in the traditional sense. An avatar of a soccer playermight be textured using data from multiple cameras and 3D data fromother users. The playing field might be made up of stitched togetherpieces of Flikr photographs. Dirt and grass might become textures on 3Dmodels captured from a database.

One of the benefits of this new medium is the ability to place the userin places where cameras weren't placed, for instance, at the level ofthe ball in the middle of the field.

The density of location-based data should substantially increase overthe next decade as companies develop next-generation standards andgeocaching becomes automated. In the soccer example above, people'sphones and wallets, and even the soccer ball, may send location-baseddata to enhance the accuracy of the system.

The use of data recombination and filtering to create 3D virtualrepresentations has other connotations as well. After the game, playersmay explore alternate plays by assigning an artificial intelligence (AI)to the opposing teams players and seeing how they react differently todifferent player positions and passing strategies.

DESCRIPTION

FIG. 1 illustrates a single element of the larger sensor mesh. A digitalrecording device 10 contains a camera 11 and internal storage 12. Thedevice connects to a wired or wireless network 13. The network 13 mayfeed a server 14 where video from the device 10 can be processed anddelivered to a network enabled local monitor 15 or entertainmentprojecttion room. This feed may be viewed in multiple locations. Userscan comment on the feed and potentially add their own media. The feedcan also contain additional information from other sensors 16 in thedevice 10. These sensors 16 may include GPS, accelerometer, microphone,light sensors, and gyroscopes. All of this information can be processedin a data center with a high degree of efficiency and this creates newoptions for software.

The feed from the smart device 10 may be optimized for streaming throughcompression and it is possible to transmit the data more efficientlyusing more application specific network protocols. But the sensornetworks may be able to use multiple feeds from a single location tocreate a more complete playback scenario. If the optimized networkprotocol includes metadata from sensors as well as a network time code,then it is possible to integrate multiple feeds offline when network andprocessor demand is lower. If the streaming video codec includes fullresolution frames that include edge detection, contrast, and motioninformation, along with the smaller frames for network streaming, thenthis information can be used to quickly build multiple feeds into asingle optimized vertex based wireframe similar to what might be used ina video game. In this scenario, the cameras/devices 10 fill the role ofa motion capture system.

The system may include the appropriate software at the smart devicelevel, the system level, and the home computer level. It may also benecessary to have software or a plugin for network-enabled devices suchas video game platforms or network-enabled televisions 15. Furthermore,it is possible for a network-enabled camera to provide much of thisfunctionality and the words Smartphone, Smart Device, and NetworkEnabled Camera are used interchangably where it relates to the streamingof content to the web.

FIG. 2 illustrates multiple smartphones 20 used by spectators/userswatching a soccer game 21. These phones 20 are in multiple locationsalong the field. All the phones may use an installed application tostream data to a central server 24. In this instance the spectators maybe most interested in the players 23 but the action may tend to followthe ball 22.

To configure the cameras for a shared event capture, a user 25 mightperform a specific task in the application software such as aligning thegoal at one end of the field 26 with a marker in the application andthen panning the camera to the other goal 27 and aligning that goal witha marker in the application. This information helps define the otherphysical relationships on the field. The configuration may also involvetaking pictures of the players tracked in the game. Numbers and otherprominent features can be used in the software to name and identifyplayers later in the process.

FIG. 3 illustrates a key tendency of video used in large sportingevents. During game play, the action tends to follow the ball 31 andusers 32 will tend to videotape the players that most interest them—whomay be in the action, while other users may follow players 33 not in theaction. children but their children will tend to follow the ball.Software can evaluate the various streams and determine where the focalpoint of the event is by considering where the majority of cameras arepointed. It is possible that the software will make the wrong choice(outside the context of a soccer game, magic and misdirection beingexamples of this . . . where the eyes follow an empty hand believed tobe full) but in most situations, the crowd-based data corresponding towhat the majority is watching will yield the best edit/picture for later(or even live) viewing. On the subject of a live viewing, imagine that aviewer on the other end of a network can choose a perspective to watchlive (or even recorded), but the default is one following the placewhere most people are recording.

FIG. 4 illustrates the ability of the system to provide user feedback toimprove the quality of the 3D model by helping the users shift theirpositions to improve triangulation. The system can identify a user atone end of the field 35 and a group of users in the middle of the filed36. The system prompts one user 37 to move towards the other end of thefield and prompts them to stop when they have moved into a betterposition 38, so that what is being recorded is optimal for all viewers,i.e., captures the most data.

FIG. 5 illustrates one example of additional information that can beencoded as metadata in the video stream. One phone 41 is at a slightangle. Another phone 42 is being held completely flat. This informationcan be used as one factor in a large amount of information coming intothe servers in order to improve the 3D map that is created of the field,as each phone captures different and improved data streams.

FIG. 6 illustrates a basic stereo phenomenon. There are two phones 51,52 along the field. A spectator 54 is close to in between the two phonesand both phones pick up sound evenly from their microphones. Anotherspectator 53 is much closer to one phone 51 and the phone that isfurther away 52 will receive a sound signal at a lower decibel level.The two phones may also be able to pick up stereo pan as the ball 57 ispassed from one player 55 to another player 56. A playback system canuse GPS locations of each user to balance the sounds to optimize theplayback experience.

FIG. 7 illustrates multiple cameras 62 focused on a single point ofaction 63. All of this geometry along with the other sensor basedmetadata is transferred to the network based server where the content isanalyzed. If a publicly accessible image of the soccer field 61 isavailable that can also be used along with the phones GPS data toimprove the 3D image.

This composite 3D image may generate the most compelling features ofthis system. A user watching the feed at home add additional virtualcameras to the feed. These may even be point of view cameras tied toparticular individual 65. The cameras may also be located to give anoverhead view of the game.

FIG. 8 illustrates other options available given access to multiplefeeds and the ability to spread the feed over multiple GPUs and/orprocessors. A composite HDR image 71 can be created using multiple fullresolution feeds to create the best possible image. It is also possibleto add information beyond that captured by the original imager. This“zoom plus” feature 72 takes the original feed 73 and adds additionalinformation from other cameras 74 to create a larger image. It is alsopossible, in a similar vane, to stitch together a panoramic video 75covering multiple screens.

FIG. 9 displays the simple arrangement of a smartphone 81 linked to aserver 82 with that server feeding an internet channel 83. The internetchannel can be public or private and the phone serves this informationin several different ways. The output shown is a display 84. For livepurposes, the phone 81 feeds the video to the server 82, whichdistributes video over the internet 83 to a local device 84 for viewing.The viewer may record their own audio to use the feed audio and this toocan be shared over the internet via the host server 82.

Later, the owner of the phone 81 may want to watch the video themselves.Assuming the users have a version of the video on the phone that carriesthe same network time stamp as the video on the server, when theyconnect their phone into a local display 84 for playback, they may beasked if they want to use any of the supplemental features available onthe server 82. Although the server holds lower quality video than thatstored on the phone, it is capable of providing features beyond thosepossible if the user only has the phone.

This is possible because the video frame 91 is handled and used inmultiple ways on the phone 81 and at the server 82. The active stream 92is encoded for efficient transfer over possibly crowded wirelessnetworks. The encoding may be very good but the feed will not run atmaximum resolution and frame rate. Additional data is included in themetadata stream 93, which is piggybacked on the live stream. Themetadata stream is specifically tailored towards enabling functions onthe server, such as the creation of 3D mesh models in an online videogame engine and evaluating the related incoming streams to offer optionssuch as those described in FIG. 7. The Metadata stream may be able toevaluate all of the sensor information along a high structured videoinformation such as edge detection, contrast mapping, an motiondetection. The server may be able to use the Metadata stream to developfinger prints and pattern libraries. This information can be used tocreate the rough vertex maps and meshs on which other video informationcan be mapped.

When the user hooks their smart phone/device 81 up to the local device84 they connect the full resolution video 94 on the smartphone 81 to thevideo on the server 82. The software on the phone or the software on thelocal device will be able to integrate the information from these twosources.

FIG. 10 illustrates at a simple level how a vertex map might beconstructed. One user with a smartphone 101 makes a video of the game.The video has a specific viewing angle 102. There may be a documentedimage of the soccer pitch 103 available from an online mapping service.It is possible to use reference points in the image such as a player 104or the ball 105 to create one layer 106 in the 3D mesh model. Asadditional information is added, this map may get richer and moredetailed. Key fixed items like the goal post 107 may be included. Lineson the field and foreground and background objects will accumulate asthe video is fed into the server.

A second camera 108 looking at the same action may provide additional 2Ddata which can be layered into the model. Additionally, the camerasensors may help to determine the relative angle of the camera. As fixedpoints in the active image area start to get fixed in the 3D model thesystem can reprocess the individual camera feeds to refine and filterthe data.

FIG. 11 illustrates the transition from the initial video stream to theskinned avatar in the game engine. A person in the initial video stream111 is analysed and skeletal information 112 is extracted from thevideo. The game engine can use the best available information to skinthe avatar 113. This information can be from the video, from game enginefiles, from the player shots taken from the configuration mode, or fromavatars available in the online software. A user may choose a FIFA starfor example. That FIFA player may be mapped onto the skeleton 112.

FIG. 12 illustrates a second angle and the additional informationavailable in a second angle that is not available in the first imagesillustrated in FIG. 11. The skeleton 122 shows differences when comparedto the skeleton 112 in FIG. 11 based on different perspective. Theadditional information helps to produce a better avatar 123.

FIG. 13 illustrates a feature showing that once a three dimensionalmodel has been created, additional virtual camera positions can beadded. This allows a user to see a players eye view of a shot on goal131.

FIG. 14 describes the flow of data through the processing system thatconverts a locally acquired media stream and converts it into contentthat is available online in an entirely different form. The sources 141may be aggregated in the server 142 where they may be sorted by event143. This sorting may be based on time code and GPS data. In an instancewhere two users were recording images of players playing on adjacentfields the compass data from the phone may indicate that the images wereof different events. Once these events are sorted, the system may formatthe files so that all meta data is available to the processing system144. The system may examine the location and orientation of the devicesand any contextual sensing to identify the location of the users. Atthis point external data 145 may be incorporated into the system. Suchdata can determine the proper orientation of shadows or the exactgeopyhysical location of a goal post. The nature of such a large datasystem is that data from each game at a specific location will improvethe users experience on the next game. User characteristics such asrepeatedly choosing specific seats at a field may also feed intoimproved system performance over time. The system will sort throughthese events and build initial masking and depth analysis based on pixelflow (movement in the frame) of individual cameras, correcting forcamera motion. In this analysis, it may look for moving people andperform initial skeletal tracking as well as ball location.

The system may tag and weight points based on whether they were harddata from the frame or interpolated from pixel flow. It may also lookfor static positions, like trees and lamp posts that may be used astrackers. In this process, it may deform all images from all cameras sothat they were consistent, based on camera internals. The systemevaluates the data by searching all video streams identified for aspecific event, looking for densely covered scenes. These scenes may beused to identify key frames 146 that form the starting point for the 3Danalysis of the event. The system may start at the points in the eventat which there was the richest dataset among all of the video streamsand then proceed to work forward and backward from those points. Thesystem may then go through frame by frame, choosing a 1st image to workfrom to start background subtraction 147. The image may be chosenbecause it was at the center of the baseline and because it had a lot ofactivity.

The system may then choose a second image from either the left or rightof the baseline that was looking at the same location and had similarcontent. It may perform background subtraction on the content. Thesystem may build depth maps of knocked out content from the two frames,performing point/feature mapping using the fact that they share the samelight source as a baseline. The location of features may be prioritizedbased on initial weighting from pixel flow analysis in step one. Whenthere is disagreement between heavily weighted data 148, skeletalanalysis may be performed 149, based on pixelflow analysis. The systemmay continue this process comparing depthmaps and stiching additionalpoints onto the original point cloud. Once the cloud was rich enough,the system may then perform a second pass 150, looking at shadow detailon the ground plane and on bodies to fill in occluded areas. Throughoutthis process, the system may associate pixel data, performing nearestneighbor and edge detection across the frame and time. Pixels may bestacked on the point cloud. The system may then take an image at theother side of the baseline and perform the same task 151. Once the pointcloud is well defined and 3D skeletal models created, these may be usedto run an initial simulation of the event. This simulation may bechecked for accuracy against a raycast of the skinned pointcloud. Iffiltering determined that the skinning was accurate enough or that therewere irrecoverable events within the content, editing and camerapositioning may occur 153. If key high-speed motions, like kicks, wereanalyzed the may be replaces with animated motion. The skeletal data maybe skinned with professionally generated content, user generated contentor collapsed pixel clouds 154. And this finished data may be madeavailable to users 155.

The finished data can be made available in multiple ways. For example, auser can watch a 3D video online based on the video stream theyinitially submitted. A user can watch a 3D video of the game based onthe edit decisions of the system. A user can order a 3D video of thegame on a single write video format. A user can use a video game engineto navigate the game in real time watching from virtual camera postionsthat have been inserted into the game. A user can play the game in thevideo game engine. A soccer game may be ported into the FIFA gameengine, for example. A user can customize the game swapping in theirfavorite professional player in their position or an opponents position.

If a detailed enough model is created it may be possible to use highlydetailed prerigged avatars to represent players on the field. The actualplayers faces can be added. This creates yet another viewing option.Such an option may be very good for more abstracted uses of the contentsuch as coaching.

While soccer has been used as an example throughout, other sportingevents could also be used. Other applications for this include any eventwith multiple camera angles including warfare or warfare simulation, anysporting event, and concerts.

While the present disclosure has been described with respect to alimited number of embodiments, those skilled in the art, having benefitof this disclosure, will appreciate that other embodiments may bedevised which do not depart from the scope of the disclosure asdescribed herein.

1. A system for creating images for display comprising: a firstrecording device that records digital images; a server that receives theimages from the first device; wherein the server, based on digital imagedata from a source remote to the server and the first recording device,adds visual content to the received digital images from the first deviceto create an image for display.
 2. The system of claim 1, wherein theserver receives GPS information received from the first recordingdevice.
 3. The system of claim 1, wherein the server receivesaccelerometer data received from the first recording device.
 4. Thesystem of claim 1, wherein the server receives sound signal datareceived from the first recording device.
 5. The system of claim 1,wherein the data from a source remote to the server comprises digitalimages received from a second recording device that records digitalimages.
 6. The system of claim 5, wherein the server uses image datareceived from both the first recording device and second recordingdevice to create a wireframe image.
 7. The system of claim 5, whereinthe server includes a video game engine and the image data from thefirst recording device and second recording device has been mapped intothe video game engine.
 8. The system of claim 7, wherein a user can movethe recording device's positions within the video game engine to createnew perspectives.
 9. The system of claim 5, wherein the first and secondrecording devices record sound data and the server combines the sounddata to create a sound output.
 10. The system of claim 5, wherein theserver uses image data received from both the first recording device andsecond recording device to create a single video stream.
 11. The systemof claim 2, wherein the server compares metadata from a plurality ofrecording devices to determine location of the recording devices and theserver creates a digital environment based on image data from theplurality of recording devices.
 12. A method for creating displayablevideo from multiple recordings comprising: creating a sensor meshwherein the sensors record video from multiple perspectives on multiplesensors; comparing the multiple recorded videos to one another on aserver networked to the multiple sensors; based on the comparison,creating a video stream that is comprised of data from the multipleperspectives from the multiple sensors.
 13. The method of claim 12,wherein based on the comparison, creating multiple video streams fordisplay.
 14. The method of claim 13, wherein the multiple video streamscomprise multiple perspectives.